Regression without regrets
Workflow of initial data analyses
Regression without regrets: Workflow of initial data analyses
1 Overview
The focus of this document/website is to provide guidance on conducting initial data analysis in a reproducible manner in the context of intended regression analyses.
TODO: to add. create ToC dynamically:
2 IDA Framework
The IDA framework consists of six steps [Huebner et al 2018, Figure 1], here we assume that metadata (step I) exist in sufficient detail, and that data cleaning (step II) was already performed. Metadata summarize background information about the data to properly conduct IDA steps, and a data cleaning process identifies and corrects technical errors. The data screening (step III) examines data properties to inform decisions about the intended analysis. Initial data reporting (step IV) document insight of the previous steps and can be referred to when interpreting results from the regression modeling. Consequences of these analyses can be that the analysis plan needs to be refined or updated (step V). Finally, reporting of IDA results in research papers (step VI) are necessary to ensure transparency regarding key findings that influence the analysis or interpretation of results. Further details about the elements of IDA are discussed in [TG3 papers].
IDA framework
References
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link
Huebner M, Vach W, le Cessie S, Schmidt C, Lusa L. Hidden Analyses: a review of reporting practice and recommendations for more transparent reporting of initial data analyses. BMC Med Res Meth 2020; 20:61. Link
3 Scope of the regression analyses for the examples
Regression models can be used for a wide range of purposes, for the purpose of these examples the assumptions on the regression analysis set-up in this paper are listed in Table 1. Thus, IDA tasks will be explained in a well-defined, practically relevant setting. Since a key principle is that IDA does not touch the research question no associations between dependent (outcome) and independent (non-outcome) variables are considered.
Table 1: The scope of the regression analyses considered for IDA tasks
| Aspects of the research plan | Assumptions in this paper | Reason for the assumption |
|---|---|---|
| Dependent (outcome) variable | One dependent variable that can be continuous or binary; exclude time-to-event or longitudinal outcomes | Explain IDA tasks in a well-defined, practically relevant setting |
| Regression models | Models with linear predictors | Explain IDA tasks in a well-defined, practically relevant setting |
| Purpose of regression model | Adjust effect of one variable of interest for confounders; quantify the effects of explanatory variables on the outcome | Explain IDA tasks in a well-defined, practically relevant setting |
| Independent variables | “explanatory” or “confounder” depending on purpose of model; small to moderate number of mixed types; Not high dimensional; no repeated measurements | To demonstrate IDA approaches for a mix of variables likely to be encountered in practice |
| Statistical analysis plan | Exists, defines the outcome variable, the type of regression model to be used, and a set of independent variables | IDA does not touch the research question, but may lead to an update or refinement of the analysis plan |
References:
Vach W. Regression Models as a Tool in Medical Research. Chapman/Hall CRC 2012
Harrell FE. Regression Modeling Strategies. Springer (2nd ed) 2015
Royston P and Sauerbrei W. Multivariable Model Building. Wiley (2008)
[…]
4 Data screening and possible actions
TODO: Check for copy and paste errors in table.
4.1 Univariate distributions
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Continuous variables | General skewness | Help in interpreting results | Update SAP | Update intended presentation of results |
| Continuous variables | General skewness | Wide CI for coefficients | Use variable as log-transformed | Update intended presentation of results |
| Continuous variables | Outliers | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
| Continuous variables | Spike at 0 | Narrow CI at 0 | Use appropriate representation of variable in model | Use 2 (or more) coefficients to distinguish 0 from non-0 continuous part |
| Categorical variables | Frequencies | Comparisons to default reference probably irrelevant | Change reference category | Contrasts compare to (new) reference category |
| Categorical variables | Rare categories | Wide CI for coefficients | Collapse/exclude | Fewer categories to present |
| Categorical variables | One very frequent category | Comparisons irrelevant? | Exclude variable | Variable omitted |
4.2 Bivariate distributions
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Continuous by continuous | Outliers (from the cloud) | Disproportional impact on results | Winsorize or transform | Model involves winsorization |
| Continuous by continuous | Correlations | Wide CI for coefficients | Winsorize or transform | Model involves winsorization |
| Continuous by categorical | Outliers (only visible in bivariate plot) | Wide CI for coefficients | ||
| Categorical by categorical | Frequent/rare combinations | Comparison to default reference irrelevant | Change reference category | Contrasts compare to (new) reference category |
| Categorical by categorical | Frequent/rare combinations | interactions relevant? | Remove interaction from model | Fewer interactions to present |
4.3 Missing values
| What to look at | Possible actions: Interpretation | Possible actions: SAP | Possible actions: Presentation | |
|---|---|---|---|---|
| Per variable | Number and proportion | Wide CI for coefficients | Remove variable if many missing values | |
| Pattern | Variables missing independently or together | Omit variables together | Changes model | |
| Pattern | Variables missing dependent on levels of other variables | Systematic missingness? Model still based on representative? | IPW needed? | Weighted analysis |
| Complete cases | Number and proportion | Few cases left for main CCO analysis | Multiple imputation (or other way of dealing with missing values)? | Result from MI analysis? Or applicability restricted to a subpopulation? |
References
Huebner M, le Cessie S, Schmidt CO, Vach W . A contemporary conceptual framework for initial data analysis. Observational Studies 2018; 4: 171-192. Link
Harrell
[…]
CRASH-2
4.4 Introduction to CRASH-2
Description: Clinical Randomisation of an Antifibrinolyticin Significant Haemorrhage(CRASH-2) was a large randomised placebo controlled trial among trauma patients with, or at risk of, significant haemorrhage, of the effects of antifibrinolytic treatment on death and transfusion requirement. The study is described at the original trial website. A public version of the data set is found at a repository of public data sets hosted by the Vanderbilt University’s Department of Biostatistics (Prof. Frank Harrell Jr.).
The data set includes 20,207 patients and 44 variables.
Hypothetical research aim for IDA: Develop a multivariable model for early death (death within 28 days from injury) using 9 independent variables of mixed type (continuous, categorical, semicontinuous) with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome. The assumed analysis aim is in line with the prediction model presented by Perel et al, BMJ 2012, supplement available at. However, in contrast to the analysis described there, variables describing the economic region and the treatment allocation are missing in the public version of the data set, and while the data set contains 20,207 patients, the research paper mentions 20,127 patients having been included in the study.
4.5 Crash2 dataset contents
4.5.1 Source dataset
Display the source dataset contents. The dataset is in the data-raw folder of the project directory.
TODO: Move the contents of the original data set to an appendix? IS it relevant for us?
Data frame:crash2
20207 observations and 44 variables, maximum # NAs:17121| Name | Labels | Units | Levels | Class | Storage | NAs |
|---|---|---|---|---|---|---|
| entryid | Unique Numbers for Entry Forms | integer | integer | 0 | ||
| source | Method of Transmission of Entry Form to CC | 5 | integer | 0 | ||
| trandomised | Date of Randomization | Date | double | 0 | ||
| outcomeid | Unique Number From Outcome Database | integer | integer | 80 | ||
| sex | 2 | integer | 1 | |||
| age | integer | 4 | ||||
| injurytime | Hours Since Injury | numeric | double | 11 | ||
| injurytype | 3 | integer | 0 | |||
| sbp | Systolic Blood Pressure | mmHg | integer | integer | 320 | |
| rr | Respiratory Rate | /min | integer | integer | 191 | |
| cc | Central Capillary Refille Time | s | integer | integer | 611 | |
| hr | Heart Rate | /min | integer | integer | 137 | |
| gcseye | Glasgow Coma Score Eye Opening | integer | integer | 732 | ||
| gcsmotor | Glasgow Coma Score Motor Response | integer | integer | 732 | ||
| gcsverbal | Glasgow Coma Score Verbal Response | integer | integer | 735 | ||
| gcs | Glasgow Coma Score Total | integer | integer | 23 | ||
| ddeath | Date of Death | Date | double | 17121 | ||
| cause | Main Cause of Death | 7 | integer | 17118 | ||
| scauseother | Description of Other Cause of Death | 227 | integer | 0 | ||
| status | Status of Patient at Outcome if Alive | 3 | integer | 3169 | ||
| ddischarge | Date of discharge, transfer to other hospital or day 28 from randomization | Date | double | 3185 | ||
| condition | Condition of Patient at Outcome if Alive | 5 | integer | 3251 | ||
| ndaysicu | Number of Days Spent in ICU | numeric | double | 182 | ||
| bheadinj | Significant Head Injury | integer | integer | 80 | ||
| bneuro | Neurosurgery Done | integer | integer | 80 | ||
| bchest | Chest Surgery Done | integer | integer | 80 | ||
| babdomen | Abdominal Surgery Done | integer | integer | 80 | ||
| bpelvis | Pelvis Surgery Done | integer | integer | 80 | ||
| bpe | Pulmonary Embolism | integer | integer | 80 | ||
| bdvt | Deep Vein Thrombosis | integer | integer | 80 | ||
| bstroke | Stroke | integer | integer | 80 | ||
| bbleed | Surgery for Bleeding | integer | integer | 80 | ||
| bmi | Myocardial Infarction | integer | integer | 80 | ||
| bgi | Gastrointestinal Bleeding | integer | integer | 80 | ||
| bloading | Complete Loading Dose of Trial Drug Given | integer | integer | 80 | ||
| bmaint | Complete Maintenance Dose of Trial Drug Given | integer | integer | 80 | ||
| btransf | Blood Products Transfusion | integer | integer | 80 | ||
| ncell | Number of Units of Red Call Products Transfused | numeric | double | 9963 | ||
| nplasma | Number of Units of Fresh Frozen Plasma Transfused | integer | integer | 9964 | ||
| nplatelets | Number of Units of Platelets Transfused | integer | integer | 9964 | ||
| ncryo | Number of Units of Cryoprecipitate Transfused | integer | integer | 9964 | ||
| bvii | Recombinant Factor VIIa Given | integer | integer | 374 | ||
| boxid | Treatment Box Number | integer | integer | 0 | ||
| packnum | Treatment Pack Number | integer | integer | 0 |
| Variable | Levels |
|---|---|
| source | telephone |
| telephone entered manually | |
| electronic CRF by email | |
| paper CRF enteredd in electronic CRF | |
| electronic CRF | |
| sex | male |
| female | |
| injurytype | blunt |
| penetrating | |
| blunt and penetrating | |
| cause | bleeding |
| head injury | |
| myocardial infarction | |
| stroke | |
| pulmonary embolism | |
| multi organ failure | |
| other | |
| scauseother | |
| Acute Hypoxia | |
| ACUTE LUNG INJURY | |
| Acute Pulmonary Oedema | |
| Acute Renal Failure | |
| ACUTE RESPIRATORY DISTRESS SYNDROME (ARDS) | |
| acute respiratory failure | |
| acute respiratory failure+sepsis | |
| air amboli (embolism) | |
| Air embolism caused by penetrating lung trauma | |
| ... | |
| status | discharged |
| still in hospital | |
| transferred to other hospital | |
| condition | no symptoms |
| minor symptoms | |
| some restriction in lifestyle but independent | |
| dependent, but not requiring constant attention | |
| fully dependent, requiring attention day and night |
4.5.2 Updated analysis dataset
Additional meta-data is added to the original source data set. We write this new modified data set back to the data folder after adding additional meta-data for the following variables:
- age - add label “Age” and unit “years”.
- injury time - add unit “hours”.
- total Glasgow coma score - add unit “points”.
TODO:
- Do we want to select patients at this point or leave this for the analysis phase?
- Do we also want to do a selection of variables here to take in to the IDA phase? i.e. drop variables we do not check in IDA?
As a cross check we display the contents again to ensure the additional data is added, and then write back the changes to the data folder in the file “data/a_crash2.rds”.
## Complete metadata by adding missing labels.
## Generate adervived dataset stored in data as we are adding to the oirginal source dataset obtained.
## select candidate predictor variables. -- See SAP
crash2_subset <-
crash2 %>%
dplyr::select(
entryid, # patient identifer
trandomised, # date of randomisation
ddeath, # date of death
age, # Age (`age`, years)
sex, # Sex (`sex`, male or female)
sbp, # Systolic blood pressure (`sbp`, mmHg)
hr, # Heart rate (`hr`, 1/min)
rr, # Respiratory rate (`rr`, 1/min)
gcs, # Glasgow coma score (`gcs`, points)
cc, # Central capillary refill time (`cc`, seconds)
injurytime, # Time since injury (`injurytime`, hours)
injurytype # Type of injury (`injurytype`, 'blunt', 'penetrating' or 'blunt and penetrating')
)
## Complete metadata by adding missing labels.
a_crash2 <- Hmisc::upData(
crash2_subset,
labels = c(
age = 'Age',
sex = "Sex",
injurytype = "Injury type",
time2death = "Time from randomization to day of death"),
units = c(
age = "years",
injurytime = "hours",
gcs = "points",
time2death = "days"
)
)Input object size: 1221480 bytes; 12 variables 20207 observations New object size: 1223272 bytes; 12 variables 20207 observations
## Derivie outcome variable
a_crash2$time2death <-
as.numeric(as.Date(a_crash2$ddeath) - as.Date(a_crash2$trandomised))
a_crash2$earlydeath[!is.na(a_crash2$time2death)] <-
(a_crash2$time2death[!is.na(a_crash2$time2death)] <= 28) + 0
# +0 to transform it from TRUE/FALSE to 1/0
# NA in time2death means alive at day 28
a_crash2$earlydeath[is.na(a_crash2$time2death)] <- 0
## Add meta data
a_crash2 <- Hmisc::upData(a_crash2 ,
labels = c(earlydeath = 'Death within 28 days from injury'))Input object size: 1546808 bytes; 14 variables 20207 observations New object size: 1385720 bytes; 14 variables 20207 observations
Data frame:a_crash2
20207 observations and 14 variables, maximum # NAs:17121| Name | Labels | Units | Levels | Class | Storage | NAs |
|---|---|---|---|---|---|---|
| entryid | Unique Numbers for Entry Forms | integer | integer | 0 | ||
| trandomised | Date of Randomization | Date | double | 0 | ||
| ddeath | Date of Death | Date | double | 17121 | ||
| age | Age | years | integer | integer | 4 | |
| sex | Sex | 2 | integer | 1 | ||
| sbp | Systolic Blood Pressure | mmHg | integer | integer | 320 | |
| hr | Heart Rate | /min | integer | integer | 137 | |
| rr | Respiratory Rate | /min | integer | integer | 191 | |
| gcs | Glasgow Coma Score Total | points | integer | integer | 23 | |
| cc | Central Capillary Refille Time | s | integer | integer | 611 | |
| injurytime | Hours Since Injury | hours | numeric | double | 11 | |
| injurytype | Injury type | 3 | integer | 0 | ||
| time2death | integer | 17121 | ||||
| earlydeath | Death within 28 days from injury | integer | integer | 0 |
| Variable | Levels |
|---|---|
| sex | male |
| female | |
| injurytype | blunt |
| penetrating | |
| blunt and penetrating |
5 Statistical analysis plan
5.1 Outcome
Early death, i.e. in-hospital death within 28 days from injury (binary variable)
5.2 Statistical methods
Logistic regression will be used to model early death by the following independent variables (measured at randomisation) deemed important to predict early death.
Demographic measurements:
- Age (
age, years) - Sex (
sex, male or female)
Physiological measurements:
- Systolic blood pressure (
sbp, mmHg) - Heart rate (
hr, 1/min) - Respiratory rate (
rr, 1/min) - Glasgow coma score (
gcs, points) - Central capillary refill time (
cc, seconds)
Characteristics of injury measurements:
- Time since injury (
injurytime, hours) - Type of injury (
injurytype, ‘blunt’, ‘penetrating’ or ‘blunt and penetrating’)
Restricted cubic splines with 3 degrees of freedom with knots set to default values will be used for continuous variables. As the final prediction model should be parsimonious enough to simplify its application, a backward elimination algorithm with a significance level set at \(\alpha=0.05\) will be applied to remove statistically insignificant effects. Finally, nonlinear representation of each continuous variable will be tested against linear representation at \(\alpha=0.05\). In case of lacking added value of a nonlinear effect, the model will be refitted with a linear effect for that variable.
5.3 Remarks
Regarding type of injury, the original paper describes its treatment in the model as follows: ‘Type of injury had three categories—-penetrating, blunt, or blunt and penetrating—but we analysed it as ’penetrating’ or ‘blunt and penetrating.’ ’ It is not clear from that description what happened to the ‘blunt’ group. (I assume they were collapsed with ‘blunt and penetrating’.)
The original paper describes the modeling approach as follows: ‘We used a backward step-wise approach. Firstly, we included all potential prognostic factors and interaction terms that users considered plausible. These interactions included all potential predictors with type of injury, time since injury, and age. We then removed, one at a time, terms for which we found no strong evidence of an association, judged according to the P values (<0.05) from the Wald test.’ This would mean they tested at least 24 interaction terms, each possibly using several degrees of freedom! In the final model, only an interaction of Glasgow coma score and type of injury was included.
5.4 Preparations
The outcome variable, early death (i.e., death within 28 days from injury) must be computed from the time span between date of death and date of randomization using the following logic:
- transform ddeath and trandomisation into an interpretable date format and then compute the difference
- interpret missing (i.e. NAs) as ‘not died within study period, at least not within 28 days’
- if patients died after 28 days, treat as alive
This can be derived using the following code:
## NOTE: This is for demostration purposes, this code is not run here.
## The derivation was executed earlier.
a_crash2$time2death <-
as.numeric(as.Date(a_crash2$ddeath) - as.Date(a_crash2$trandomised))
a_crash2$earlydeath[!is.na(a_crash2$time2death)] <-
(a_crash2$time2death[!is.na(a_crash2$time2death)] <= 28) + 0
# +0 to transform it from TRUE/FALSE to 1/0
# NA in time2death means alive at day 28
a_crash2$earlydeath[is.na(a_crash2$time2death)] <- 0 | Characteristic | N = 202071 |
|---|---|
| Death within 28 days from injury | 3076 (15%) |
|
1
Statistics presented: n (%)
|
|
The number of deaths computed in the data set coincides with the number reported in Perel et al, BMJ 2012.
5.5 Sources
Data obtained from http://biostat.mc.Vanderbilt.edu/dataSets
5.5.1 Data dictionary
The data dictionary can be found LINK
5.6 References
CRASH-2 Collaborators. Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): a randomised, placebo-controlled trial. Lancet 2010;376:23-32
Perel P, Prieto-Merino D, Shakur H, Clayton T, Lecky F, Bouamra O, Russell R, Faulkner M, Steyerberg EW, Roberts I. Predicting early death in patients with traumatic bleeding: development and validation of prognostic model. BMJ 2012; 345(aug15 1): e5166.
5.7 Session info
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gtsummary_1.2.6 Hmisc_4.4-0 Formula_1.2-3 survival_3.2-3
## [5] lattice_0.20-40 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
## [9] purrr_0.3.4 readr_1.3.1 tidyr_1.0.2 tibble_3.0.1
## [13] ggplot2_3.3.0 tidyverse_1.3.0 here_0.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 sass_0.2.0 jsonlite_1.6.1
## [4] splines_3.6.1 modelr_0.1.6 assertthat_0.2.1
## [7] latticeExtra_0.6-29 cellranger_1.1.0 yaml_2.2.1
## [10] pillar_1.4.4 backports_1.1.7 glue_1.4.1
## [13] digest_0.6.25 RColorBrewer_1.1-2 checkmate_2.0.0
## [16] rvest_0.3.5 colorspace_1.4-1 htmltools_0.4.0
## [19] Matrix_1.2-18 pkgconfig_2.0.3 broom_0.5.5
## [22] haven_2.2.0 bookdown_0.18 scales_1.1.1
## [25] jpeg_0.1-8.1 htmlTable_1.13.3 generics_0.0.2
## [28] ellipsis_0.3.0 withr_2.2.0 nnet_7.3-13
## [31] cli_2.0.2 magrittr_1.5 crayon_1.3.4
## [34] readxl_1.3.1 evaluate_0.14 fs_1.3.2
## [37] fansi_0.4.1 nlme_3.1-145 xml2_1.2.5
## [40] foreign_0.8-76 tools_3.6.1 data.table_1.12.8
## [43] hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0
## [46] reprex_0.3.0 cluster_2.1.0 compiler_3.6.1
## [49] rlang_0.4.6 grid_3.6.1 gt_0.2.0.5
## [52] rstudioapi_0.11 htmlwidgets_1.5.1 base64enc_0.1-3
## [55] rmarkdown_2.1 gtable_0.3.0 DBI_1.1.0
## [58] R6_2.4.1 gridExtra_2.3 lubridate_1.7.4
## [61] knitr_1.28 commonmark_1.7 rprojroot_1.3-2
## [64] stringi_1.4.6 rmdformats_0.3.7 Rcpp_1.0.4.6
## [67] vctrs_0.3.0 rpart_4.1-15 acepack_1.4.1
## [70] png_0.1-7 dbplyr_1.4.2 tidyselect_1.1.0
## [73] xfun_0.12
6 Univariate distributions
Univariate summary CRASH-2 dataset
6.1 Data set overview
Using Hmisc describe function, provide an overview of the data set is provided including histograms of continuous variables.
6.1.1 Demographic variables
TODO: Should we plot the marginal distribution of the outcome?
2 Variables 20207 Observations
age: Age years
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20203 | 4 | 84 | 0.999 | 34.56 | 15.55 | 18 | 19 | 24 | 30 | 43 | 55 | 64 |
sex: Sex
| n | missing | distinct |
|---|---|---|
| 20206 | 1 | 2 |
Value male female Frequency 16935 3271 Proportion 0.838 0.162
6.1.2 Physiological measurements
5 Variables 20207 Observations
sbp: Systolic Blood Pressure mmHg
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19887 | 320 | 173 | 0.989 | 98.45 | 27.86 | 60 | 70 | 80 | 95 | 110 | 130 | 143 |
hr: Heart Rate /min
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20070 | 137 | 173 | 0.996 | 104.5 | 23.38 | 70 | 80 | 90 | 105 | 120 | 130 | 140 |
rr: Respiratory Rate /min
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20016 | 191 | 68 | 0.99 | 23.06 | 7.052 | 14 | 16 | 20 | 22 | 26 | 30 | 35 |
gcs: Glasgow Coma Score Total points
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20184 | 23 | 13 | 0.863 | 12.47 | 3.594 | 4 | 6 | 11 | 15 | 15 | 15 | 15 |
Value 3 4 5 6 7 8 9 10 11 12 13 14
Frequency 784 520 441 584 733 576 504 663 586 951 1356 2140
Proportion 0.039 0.026 0.022 0.029 0.036 0.029 0.025 0.033 0.029 0.047 0.067 0.106
Value 15
Frequency 10346
Proportion 0.513
cc: Central Capillary Refille Time s
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19596 | 611 | 20 | 0.945 | 3.267 | 1.67 | 1 | 2 | 2 | 3 | 4 | 5 | 6 |
Value 1 2 3 4 5 6 7 8 9 10 11 12
Frequency 1510 5328 6020 3367 1805 802 268 271 45 139 3 7
Proportion 0.077 0.272 0.307 0.172 0.092 0.041 0.014 0.014 0.002 0.007 0.000 0.000
Value 13 15 16 17 18 20 30 60
Frequency 3 19 3 1 1 2 1 1
Proportion 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000
6.1.3 Characteristics of injury
2 Variables 20207 Observations
injurytime: Hours Since Injury hours
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 20196 | 11 | 93 | 0.972 | 2.844 | 2.35 | 0.5 | 1.0 | 1.0 | 2.0 | 4.0 | 6.0 | 7.0 |
injurytype: Injury type
| n | missing | distinct |
|---|---|---|
| 20207 | 0 | 3 |
Value blunt penetrating blunt and penetrating Frequency 11189 6552 2466 Proportion 0.554 0.324 0.122
6.2 Categorical plots
A closer examination of the categorical predictors.
6.2.1 Categorical ordinal plots
The Glasgow coma score, an ordinal categorical variable, is also displayed separately.
6.3 Continuous plots
A closer examination of continuous predictors.
There is evidence of digit preference. Explore further with targeted summaries.
More detailed univariate summaries for the variables of interest are also provided below.
6.3.1 Age
Distribution of subject age [years]
6.3.2 Blood pressure
Distribution of SBP
6.3.3 Respiratory rate
Distribution of respiratory rate
6.3.4 Heart rate
Distribution of heart rate
6.3.5 Central capillary refill time
Distribution of Central capillary refill time
6.3.6 Hours since injury
Distribution of hours since injury
6.4 Session info
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Hmisc_4.4-0 Formula_1.2-3 survival_3.2-3 lattice_0.20-40
## [5] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5 purrr_0.3.4
## [9] readr_1.3.1 tidyr_1.0.2 tibble_3.0.1 ggplot2_3.3.0
## [13] tidyverse_1.3.0 here_0.1
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.1 jsonlite_1.6.1 splines_3.6.1
## [4] modelr_0.1.6 assertthat_0.2.1 highr_0.8
## [7] latticeExtra_0.6-29 cellranger_1.1.0 yaml_2.2.1
## [10] pillar_1.4.4 backports_1.1.7 glue_1.4.1
## [13] digest_0.6.25 RColorBrewer_1.1-2 checkmate_2.0.0
## [16] rvest_0.3.5 colorspace_1.4-1 htmltools_0.4.0
## [19] Matrix_1.2-18 pkgconfig_2.0.3 broom_0.5.5
## [22] haven_2.2.0 bookdown_0.18 patchwork_1.0.0
## [25] scales_1.1.1 jpeg_0.1-8.1 htmlTable_1.13.3
## [28] farver_2.0.3 generics_0.0.2 ellipsis_0.3.0
## [31] withr_2.2.0 nnet_7.3-13 cli_2.0.2
## [34] magrittr_1.5 crayon_1.3.4 readxl_1.3.1
## [37] evaluate_0.14 fs_1.3.2 fansi_0.4.1
## [40] nlme_3.1-145 xml2_1.2.5 foreign_0.8-76
## [43] tools_3.6.1 data.table_1.12.8 hms_0.5.3
## [46] lifecycle_0.2.0 munsell_0.5.0 reprex_0.3.0
## [49] cluster_2.1.0 compiler_3.6.1 rlang_0.4.6
## [52] grid_3.6.1 rstudioapi_0.11 htmlwidgets_1.5.1
## [55] base64enc_0.1-3 labeling_0.3 rmarkdown_2.1
## [58] gtable_0.3.0 DBI_1.1.0 R6_2.4.1
## [61] gridExtra_2.3 lubridate_1.7.4 knitr_1.28
## [64] rprojroot_1.3-2 stringi_1.4.6 rmdformats_0.3.7
## [67] Rcpp_1.0.4.6 vctrs_0.3.0 rpart_4.1-15
## [70] acepack_1.4.1 png_0.1-7 dbplyr_1.4.2
## [73] tidyselect_1.1.0 xfun_0.12
7 Bivariate distributions
This code is a continuous by example
7.1 Summary by sex
| male (N=16935) | female (N=3271) | |
|---|---|---|
| Age | ||
| Median | 30.0 | 35.0 |
| Mean | 33.7 | 38.8 |
| SD | 13.6 | 16.8 |
| Q1, Q3 | 23.0, 41.0 | 25.0, 50.0 |
| Range | 1.0 - 99.0 | 15.0 - 96.0 |
| N-Miss | 3 | 1 |
| Heart Rate | ||
| Median | 105.0 | 106.0 |
| Mean | 104.3 | 105.2 |
| SD | 21.2 | 21.0 |
| Q1, Q3 | 90.0, 120.0 | 92.0, 120.0 |
| Range | 3.0 - 198.0 | 3.0 - 220.0 |
| N-Miss | 95 | 42 |
| Respiratory Rate | ||
| Median | 22.0 | 22.0 |
| Mean | 23.1 | 23.0 |
| SD | 6.8 | 6.6 |
| Q1, Q3 | 20.0, 26.0 | 20.0, 26.0 |
| Range | 1.0 - 96.0 | 3.0 - 87.0 |
| N-Miss | 143 | 48 |
| Systolic Blood Pressure | ||
| Median | 95.0 | 90.0 |
| Mean | 98.8 | 96.7 |
| SD | 25.5 | 25.7 |
| Q1, Q3 | 80.0, 110.0 | 80.0, 110.0 |
| Range | 4.0 - 240.0 | 20.0 - 250.0 |
| N-Miss | 267 | 53 |
| Characteristic | male, N = 169351 | female, N = 32711 | (Missing), N = 11 |
|---|---|---|---|
| Age | 34 (13.6) | 39 (16.8) | 30 (NA) |
| Unknown | 3 | 1 | 0 |
| Heart Rate | 104 (21) | 105 (21) | 108 (NA) |
| Unknown | 95 | 42 | 0 |
| Respiratory Rate | 23 (7) | 23 (7) | 22 (NA) |
| Unknown | 143 | 48 | 0 |
| Systolic Blood Pressure | 99 (26) | 97 (26) | 100 (NA) |
| Unknown | 267 | 53 | 0 |
| Central Capillary Refille Time | 3 (2) | 3 (2) | 4 (NA) |
| Unknown | 509 | 102 | 0 |
| Glasgow Coma Score Total | 12 (4) | 13 (3) | 14 (NA) |
| Unknown | 19 | 4 | 0 |
| Hours Since Injury | 2.85 (2.39) | 2.84 (2.67) | 1.00 (NA) |
| Unknown | 10 | 1 | 0 |
| Injury type | |||
| blunt | 8962 (53%) | 2227 (68%) | 0 (0%) |
| penetrating | 5930 (35%) | 621 (19%) | 1 (100%) |
| blunt and penetrating | 2043 (12%) | 423 (13%) | 0 (0%) |
|
1
Statistics presented: mean (SD); n (%)
|
|||
7.2 Continuous variables by sex
7.2.1 Distribution of age by sex
Distribution of age by sex
7.2.2 Distribution of systolic blood pressure by sex
Distribution of systolic blood pressure by sex
7.2.3 Distribution of heart rate by sex
Distribution of heart rate by sex
7.2.4 Distribution of respiratory rate by sex
Distribution of respiratory rate by sex
7.2.5 Distribution of central capillary refille time by sex
Distribution of centrail capillary refille time by sex
7.3 Age
7.3.1 Continuous
## n
## 1 19887
## n
## 1 320
bigN <- a_crash2 %>% dplyr::filter(!is.na(sbp) & !is.na(age)) %>% tally()
n_miss <- a_crash2 %>% dplyr::filter(is.na(sbp) | is.na(age)) %>% tally()
title <-
paste0("Plot of ", Hmisc::label(a_crash2$age), " and ", Hmisc::label(a_crash2$sbp))
caption <-
paste0(
"n = ",
bigN,
" subjects displayed.\n",
n_miss,
" subjects with a missing value in at least one of the variables."
)
x_axis <- paste0(Hmisc::label(a_crash2$age), " [", Hmisc::units(a_crash2$age), "]")
y_axis <- paste0(Hmisc::label(a_crash2$sbp), " [", Hmisc::units(a_crash2$sbp), "]")
p1 <- a_crash2 %>%
dplyr::filter(!is.na(sbp) & !is.na(age)) %>%
mutate(sbp = as.numeric(sbp),
age = as.numeric(age)) %>%
ggplot(aes(x = sbp, y = age)) +
ylab(x_axis) +
xlab(y_axis) +
labs(
title = title,
caption = caption
) +
geom_point(shape = 16, #size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()
p17.3.2 Continuous
p1 <- a_crash2 %>%
filter(!is.na(sbp) & !is.na(age)) %>%
mutate(sbp = as.numeric(sbp),
age = as.numeric(age)) %>%
ggplot(aes(x = sbp, y = age)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()
p2 <- a_crash2 %>%
filter(!is.na(sbp) & !is.na(hr)) %>%
mutate(sbp = as.numeric(sbp),
age = as.numeric(hr)) %>%
ggplot(aes(x = sbp, y = hr)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()
p3 <- a_crash2 %>%
filter(!is.na(sbp) & !is.na(rr)) %>%
mutate(sbp = as.numeric(sbp),
age = as.numeric(rr)) %>%
ggplot(aes(x = sbp, y = rr)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()
p4 <- a_crash2 %>%
filter(!is.na(hr) & !is.na(age)) %>%
mutate(sbp = as.numeric(hr),
age = as.numeric(age)) %>%
ggplot(aes(x = hr, y = age)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
theme_minimal()7.3.3 Continuous3
## Warning: package 'patchwork' was built under R version 3.6.3
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
7.4 Scatter plots with a third or fourth variable
Scatter plot of age and RR by sex and injury type.
ggplot(a_crash2, aes(
y = age,
x = rr
)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
facet_grid(sex ~ injurytype) +
theme_minimal()## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Warning: Removed 195 rows containing missing values (geom_point).
Scatter plot of SBP and RR by sex and injury type.
ggplot(a_crash2, aes(
y = sbp,
x = rr
)) +
geom_point(shape = 16, size = 0.5,
alpha = 0.5,
color = "firebrick2") +
geom_rug() +
facet_grid(sex ~ injurytype) +
theme_minimal()## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Don't know how to automatically pick scale for object of type labelled/integer. Defaulting to continuous.
## Warning: Removed 457 rows containing missing values (geom_point).
7.5 Session info
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.0.0 gtsummary_1.2.6 arsenal_3.4.0 Hmisc_4.4-0
## [5] Formula_1.2-3 survival_3.2-3 lattice_0.20-40 summarytools_0.9.6
## [9] janitor_2.0.1 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
## [13] purrr_0.3.4 readr_1.3.1 tidyr_1.0.2 tibble_3.0.1
## [17] ggplot2_3.3.0 tidyverse_1.3.0 here_0.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-145 matrixStats_0.56.0 fs_1.3.2
## [4] lubridate_1.7.4 RColorBrewer_1.1-2 httr_1.4.1
## [7] rprojroot_1.3-2 tools_3.6.1 backports_1.1.7
## [10] R6_2.4.1 rpart_4.1-15 lazyeval_0.2.2
## [13] DBI_1.1.0 colorspace_1.4-1 nnet_7.3-13
## [16] withr_2.2.0 tidyselect_1.1.0 gridExtra_2.3
## [19] compiler_3.6.1 cli_2.0.2 rvest_0.3.5
## [22] gt_0.2.0.5 htmlTable_1.13.3 xml2_1.2.5
## [25] plotly_4.9.2.1 labeling_0.3 sass_0.2.0
## [28] bookdown_0.18 scales_1.1.1 checkmate_2.0.0
## [31] commonmark_1.7 digest_0.6.25 foreign_0.8-76
## [34] rmarkdown_2.1 base64enc_0.1-3 jpeg_0.1-8.1
## [37] pkgconfig_2.0.3 htmltools_0.4.0 highr_0.8
## [40] dbplyr_1.4.2 htmlwidgets_1.5.1 rlang_0.4.6
## [43] readxl_1.3.1 rstudioapi_0.11 pryr_0.1.4
## [46] farver_2.0.3 generics_0.0.2 jsonlite_1.6.1
## [49] crosstalk_1.1.0.1 acepack_1.4.1 magrittr_1.5
## [52] rapportools_1.0 Matrix_1.2-18 Rcpp_1.0.4.6
## [55] munsell_0.5.0 fansi_0.4.1 lifecycle_0.2.0
## [58] stringi_1.4.6 yaml_2.2.1 snakecase_0.11.0
## [61] plyr_1.8.6 grid_3.6.1 crayon_1.3.4
## [64] haven_2.2.0 splines_3.6.1 pander_0.6.3
## [67] hms_0.5.3 magick_2.3 knitr_1.28
## [70] pillar_1.4.4 tcltk_3.6.1 codetools_0.2-16
## [73] reprex_0.3.0 glue_1.4.1 evaluate_0.14
## [76] latticeExtra_0.6-29 data.table_1.12.8 modelr_0.1.6
## [79] png_0.1-7 vctrs_0.3.0 rmdformats_0.3.7
## [82] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [85] xfun_0.12 broom_0.5.5 viridisLite_0.3.0
## [88] cluster_2.1.0 ellipsis_0.3.0
8 Missing data
TODO: organise
Identify # complete cases and patients with missing data.
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `sex` contains implicit NA, consider using
## `forcats::fct_explicit_na`
8.1 Session info
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] naniar_0.5.2 Hmisc_4.4-0 Formula_1.2-3 survival_3.2-3
## [5] lattice_0.20-40 forcats_0.5.0 stringr_1.4.0 dplyr_0.8.5
## [9] purrr_0.3.4 readr_1.3.1 tidyr_1.0.2 tibble_3.0.1
## [13] ggplot2_3.3.0 tidyverse_1.3.0 here_0.1
##
## loaded via a namespace (and not attached):
## [1] viridis_0.5.1 httr_1.4.1 viridisLite_0.3.0
## [4] jsonlite_1.6.1 splines_3.6.1 modelr_0.1.6
## [7] assertthat_0.2.1 latticeExtra_0.6-29 cellranger_1.1.0
## [10] yaml_2.2.1 pillar_1.4.4 backports_1.1.7
## [13] visdat_0.5.3 glue_1.4.1 digest_0.6.25
## [16] RColorBrewer_1.1-2 checkmate_2.0.0 rvest_0.3.5
## [19] colorspace_1.4-1 plyr_1.8.6 htmltools_0.4.0
## [22] Matrix_1.2-18 pkgconfig_2.0.3 broom_0.5.5
## [25] haven_2.2.0 bookdown_0.18 scales_1.1.1
## [28] jpeg_0.1-8.1 htmlTable_1.13.3 farver_2.0.3
## [31] generics_0.0.2 ellipsis_0.3.0 UpSetR_1.4.0
## [34] withr_2.2.0 nnet_7.3-13 cli_2.0.2
## [37] magrittr_1.5 crayon_1.3.4 readxl_1.3.1
## [40] evaluate_0.14 fs_1.3.2 fansi_0.4.1
## [43] nlme_3.1-145 xml2_1.2.5 foreign_0.8-76
## [46] tools_3.6.1 data.table_1.12.8 hms_0.5.3
## [49] lifecycle_0.2.0 munsell_0.5.0 reprex_0.3.0
## [52] cluster_2.1.0 compiler_3.6.1 rlang_0.4.6
## [55] grid_3.6.1 rstudioapi_0.11 htmlwidgets_1.5.1
## [58] labeling_0.3 base64enc_0.1-3 rmarkdown_2.1
## [61] gtable_0.3.0 DBI_1.1.0 R6_2.4.1
## [64] gridExtra_2.3 lubridate_1.7.4 knitr_1.28
## [67] rprojroot_1.3-2 stringi_1.4.6 rmdformats_0.3.7
## [70] Rcpp_1.0.4.6 vctrs_0.3.0 rpart_4.1-15
## [73] acepack_1.4.1 png_0.1-7 dbplyr_1.4.2
## [76] tidyselect_1.1.0 xfun_0.12